This chapter describes Oracle Big Data SQL. It contains the following topics:
Attention
The instructions in this chapter are for installing and using Oracle Big Data SQL 2.0 only.
Do not install version 2.0, since there is no upgrade path to 3.0.
Do not use this chapter as a reference for Oracle Big Data SQL 3.0.
See the Oracle Big Data SQL 3.0 User's Guide for the Oracle Big Data SQL 3.0 patch number, information on Big Data SQL changes, and 3.0 usage instructions.
If version 2.0 is already installed, you must uninstall it in order to install version 3.0. The Oracle Big Data SQL 3.0 patch README file includes steps for removing version 2.0.
Oracle Big Data SQL supports queries against vast amounts of big data stored in multiple data sources, including Apache Hive, HDFS, Oracle NoSQL Database, and Apache HBase. You can view and analyze data from various data stores together, as if it were all stored in an Oracle database.
Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the complete SQL syntax. You can execute the most complex SQL SELECT
statements against data in Hadoop, either manually or using your existing applications, to tease out the most significant insights. For example, users of the Oracle Advanced Analytics database option can apply their data mining models, which reside in Oracle Database, to data that is resident on Oracle Big Data Appliance.
The following sections provide further details:
Oracle Big Data SQL provides external tables with next generation performance gains. An external table is an Oracle Database object that identifies and describes the location of data outside of a database. You can query an external table using the same SQL SELECT
syntax that you use for any other database tables.
External tables use access drivers to parse the data outside the database. Each type of external data requires a unique access driver. This release of Oracle Big Data SQL includes two access drivers for big data: one for accessing data stored in Apache Hive, and the other for accessing data stored in Hadoop Distributed File System (HDFS) files.
By querying external tables, you can access data stored in HDFS and Hive tables as if that data was stored in tables in an Oracle database. Oracle Database accesses the data by using the metadata provided when the external table was created.
Oracle Database 12.1.0.2 supports two new access drivers for Oracle Big Data SQL:
ORACLE_HIVE
: Enables you to create Oracle external tables over Apache Hive data sources. Use this access driver when you already have Hive tables defined for your HDFS data sources. ORACLE_HIVE
can also access data stored in other locations, such as HBase, that have Hive tables defined for them.
ORACLE_HDFS
: Enables you to create Oracle external tables directly over files stored in HDFS. This access driver uses Hive syntax to describe a data source, assigning default column names of COL_1
, COL_2
, and so forth. You do not need to create a Hive table manually as a separate step.
Instead of acquiring the metadata from a Hive metadata store the way that ORACLE_HIVE
does, the ORACLE_HDFS
access driver acquires all of the necessary information from the access parameters. The ORACLE_HDFS
access parameters are required to specify the metadata, and are stored as part of the external table definition in Oracle Database.
Oracle Big Data SQL uses these access drivers to optimize query performance.
External tables do not have traditional indexes, so that queries against them typically require a full table scan. However, Oracle Big Data SQL extends SmartScan capabilities, such as filter-predicate offloads, to Oracle external tables with the installation of Exadata storage server software on Oracle Big Data Appliance. This technology enables Oracle Big Data Appliance to discard a huge portion of irrelevant data—up to 99 percent of the total—and return much smaller result sets to Oracle Exadata Database Machine. End users obtain the results of their queries significantly faster, as the direct result of a reduced load on Oracle Database and reduced traffic on the network.
See Also:
Oracle Database Concepts for a general introduction to external tables and pointers to more detailed information in the Oracle Database documentation library
Oracle Big Data SQL maintains Storage Index automatically, which is transparent to Oracle Database. Storage Index contains the summary of data distribution on a hard disk for the data that is stored in HDFS. Storage Index reduces the I/O operations cost and the CPU cost of converting data from flat files to Oracle Database blocks.
Storage Index can be used only for the external tables that are based on HDFS and are created using either the ORACLE_HDFS driver or the ORACLE_HIVE driver. Storage Index cannot be used for the external tables that use storage handlers, such as Apache HBase and Oracle NoSQL.
Storage Index is a collection of in-memory region indexes, and each region index stores summaries for up to 32 columns. There is one region index for each split. The content stored in one region index is independent of the other region indexes. This makes them highly scalable, and avoids latch contention.
Storage Index maintains the minimum and maximum values of the columns of a region for each region index. The minimum and maximum values are used to eliminate unnecessary I/O, also known as I/O filtering. The cell XT granule I/O bytes saved by the Storage Index statistic, available in the V$SYSSTAT
view, shows the number of bytes of I/O saved using the Storage Index.
See Also:
Oracle® Database Reference for information about V$SYSSTAT
view
Queries using the following comparisons are improved by the Storage Index:
Equality (=)
Inequality (<, !=, or >)
Less than or equal (<=)
Greater than or equal (>=)
IS NULL
IS NOT NULL
Storage Index is built automatically after Oracle Big Data SQL service receives a query with a comparison predicate that is greater than the maximum or less than the minimum value for the column in a region.
Note:
The effectiveness of Storage Index can be improved by ordering the rows in a table based on the columns that frequently appear in the WHERE query clause.
Storage Index works with any non-linguistic data type, and works with linguistic data types similar to non-linguistic index.
Example 6-1 Elimination of Disk I/O with Storage Index
The following figure shows a table and region indexes. The values in the table range from 1 to 8. One region index stores the minimum 1, and the maximum of 5. The other region index stores the minimum of 3, and the maximum of 8.
For a query such as SELECT * FROM TABLE WHERE B < 2
, only the first set of rows match. Disk I/O is eliminated because the minimum and maximum of the second set of rows do not match the WHERE clause of the query.
Example 6-2 Improved Join Performance Using Storage Index
Using Storage Index allows table joins to skip unnecessary I/O operations. For example, the following query would perform an I/O operation and apply a Bloom filter to only the first block of the fact table.
SELECT count(*) from fact, dim where fact.m=dim.m and dim.name="Hard drive"
The I/O for the second block of the fact table is completely eliminated by Storage Index as its minimum/maximum range (5,8) is not present in the Bloom filter.
Oracle Big Data Appliance already provides numerous security features to protect data stored in a CDH cluster on Oracle Big Data Appliance:
Kerberos authentication: Requires users and client software to provide credentials before accessing the cluster.
Apache Sentry authorization: Provides fine-grained, role-based authorization to data and metadata.
HDFS Transparent Encryption: Protects the data on disk and at rest. Data encryption and decryption is transparent to applications using the data.
Oracle Audit Vault and Database Firewall monitoring: The Audit Vault plug-in on Oracle Big Data Appliance collects audit and logging data from MapReduce, HDFS, and Oozie services. You can then use Audit Vault Server to monitor these services on Oracle Big Data Appliance
Oracle Big Data SQL adds the full range of Oracle Database security features to this list. You can apply the same security policies and rules to your Hadoop data that you apply to your relational data.
Oracle Big Data SQL 2.0 is available only on Oracle Exadata Database Machine connected to Oracle Big Data Appliance. You must install the Oracle Big Data SQL software on the Oracle Database Appliance as well as on all Exadata nodes.
Oracle Big Data Appliance Release 4.4 includes Oracle Big Data SQL 2.0 as a Mammoth installation option.
Note:
The instructions below are for installing Oracle Big Data SQL 2.0. Oracle Big Data SQL 3.0 is also available for Oracle Big Data Appliance 4.4. Skip the installation below if you intend to install 3.0, since 3.0 requires that you first uninstall 2.0. See the Oracle Big Data SQL 3.0 User's Guide for the 3.0 patch number.The Oracle Exadata Database Machine must be configured on the same InfiniBand subnet as Oracle Big Data Appliance and connected to Oracle Big Data Appliance over the InfiniBand network.
Table 6-1 shows the prerequisite software versions for installing each Oracle Big Data SQL release.
Table 6-1 Oracle Big Data SQL Compatibility Matrix
Oracle Big Data SQL Release | Oracle Big Data Appliance Releases | Oracle Exadata Storage Servers | Oracle Database Version | One-Off Patch |
3.0 | 4.4 | 12.1.2.1.0 or later | Oracle Database 12.1.0.2 or greater, with
January 2016 BP (12.1.0.2.160119) |
One-off patch 22778199 on top of the January 2016 BP. |
2.0 | 4.4, 4.3, 4.2 | 12.1.2.1.0 or later | Oracle Database 12.1.0.2.10 or 12.1.0.2.13 | One-off patch 21419192 is available on top of Oracle Database 12.1.0.2.10 and 12.1.0.2.13. |
1.1 | 4.2, 4.1 | 12.1.2.1.0 or later | Oracle Database 12.1.0.2.4 or later | One-off-patch 20310880 is available on top of Oracle Database 12.1.0.2.4. Not needed for 12.1.0.2.6 or later. |
Each Oracle Big Data Appliance release includes a version of Oracle Big Data SQL in the bundle as an installation option. You can also install a newer version of Oracle Big Data SQL if a patch is available, as shown in Table 6-2.
Table 6-2 Bundled Oracle Big Data SQL and Patches for Oracle Big Data Appliance
Oracle Big Data SQL Release | Oracle Big Data Appliance Releases |
3.0 | 4.4 (patch available), no patch available for releases 4.3 and 4.2 |
2.0 | 4.4 (2.0 included), 4.3 (2.0 included) , 4.2 (patch available) |
1.1 | 4.2 (1.1 included), 4.1 (patch available) |
Take the following steps to install the Oracle Big Data SQL software on Oracle Big Data Appliance and Oracle Exadata Database Machine.
Note:
Do not use this procedure with Bundle Patches 11 and 12. These patches are not supported for use with Big Data SQL 2.0.
You can use Cloudera Manager to verify that Oracle Big Data SQL is up and running. See "Managing Oracle Big Data SQL".
When you are done, if the cluster is secured by Kerberos then there are additional steps you must perform on both the cluster nodes and on the Oracle Exadata Database Machine. See Enabling Oracle Big Data SQL Access to a Kerberized Cluster.
If the Oracle Big Data Appliance was upgraded to version 4.3.0 and Big Data SQL had already been installed, then the upgrade process implicitly upgrades Oracle Big Data SQL to version 2.0. However, after the Oracle Big Data Appliance upgrade, the customer is still responsible for upgrading the Oracle Database to a supported level before re-running the post-installation script.
Important
Run bds-exa-install.sh on every node of the Exadata cluster. If this is not done, you will see RPC connection errors when the BDS service is started.To run the Oracle Big Data SQL post-installation script:
Copy the bds-exa-install.sh
installation script from the Oracle Big Data Appliance to a temporary directory on each Oracle Exadata Database machine. You can find the script on the node where Mammoth is installed, typically the first node in the cluster. For example:
wget http://bda1node07/bda/bds-exa-install.sh
Verify the name of the Oracle installation owner and set the executable bit for this user. Typically, the oracle
user owns the installation. For example:
$ ls -l bds-exa-install.sh $ chown oracle:oinstall bds-exa-install.sh $ chmod +x bds-exa-install.sh
Set the following environment variables:
$ORACLE_HOME to <database home> $ORACLE_SID to <correct db SID> $GI_HOME to <correct grid home>
Note:
You can set the grid home with the install script as mentioned in step 5 d instead of setting the $GI_HOME as mentioned in this step.
Check that TNS_ADMIN
is pointing to the directory where the right listener.ora
is running. If the listener is in the default TNS_ADMIN
location, $ORACLE HOME/network/admin
, then there is no need to define the TNS_ADMIN
. But if the listener is in a non-default location, TNS_ADMIN must correctly point to it, using the command:
export TNS_ADMIN=<path to listener.ora>
Perform this step only if the ORACLE_SID is in uppercase, else you can proceed to the next step. This is because the install script derives the CRS database resource from ORACLE_SID, only if it is in lowercase. Perform the following sequence of steps to manually pass the SID to the script, if it is in uppercase:
Run the following command to list all the resources.
$ crsctl stat res -t
From the output note down the ora.<dbresource>.db
resource name.
Run the following command to verify whether the correct ora.<dbresource>.db
resource name is returned or not.
$ ./crsctl stat res ora.<dbresource>.db
The output displays the resource names as follows:
NAME=ora.<dbresource>.db TYPE=ora.database.type TARGET=ONLINE , ONLINE STATE=ONLINE on <name01>, ONLINE on <name02>
Specify the --db-name=<dbresource>
as additional argument to the install script as follows:
./bds-exa-install.sh --db-name=<dbresource>
Additionally, you can set the grid home instead of setting the $GI_HOME as mentioned in step 3, along with the above command as follows:
./bds-exa-install.sh --db-name=<dbresource> --grid-home=<grid home>
Note:
You can skip the next step, if you performed this step.
As oracle
(or other Oracle installation owner), run the script.
./bds-exa-install.sh
You must run the script as root
in another session when prompted by the script to proceed as the oracle
user. For example,
$ ./bda-exa-install.sh: bds-exa-install: root shell script : /u01/app/oracle/product/12.1.0.2/dbhome_1/install/bds-root-<cluster-name>-setup.sh please run as root: /u01/app/oracle/product/12.1.0.2/dbhome_1/install/bds-root-<rack-name>-clu-setup.sh
A sample output is shown here:
ds-exa-install: platform is Linux bds-exa-install: setup script started at : Sun Feb 14 20:06:17 PST 2016 bds-exa-install: bds version : bds-2.0-2.el6.x86_64 bds-exa-install: bda cluster name : mycluster1 bds-exa-install: bda web server : mycluster1bda16.us.oracle.com bds-exa-install: cloudera manager url : mycluster1bda18.us.oracle.com:7180 bds-exa-install: hive version : hive-1.1.0-cdh5.5.1 bds-exa-install: hadoop version : hadoop-2.6.0-cdh5.5.1 bds-exa-install: bds install date : 02/14/2016 12:00 PST bds-exa-install: bd_cell version : bd_cell-12.1.2.0.100_LINUX.X64_160131-1.x86_64 bds-exa-install: action : setup bds-exa-install: crs : true bds-exa-install: db resource : orcl bds-exa-install: database type : SINGLE bds-exa-install: cardinality : 1 bds-exa-install: root shell script : /u03/app/oracle/product/12.1.0/dbhome_1/install/bds-root-mycluster1-setup.sh please run as root: /u03/app/oracle/product/12.1.0/dbhome_1/install/bds-root-mycluster1-setup.sh waiting for root script to complete, press <enter> to continue checking.. q<enter> to quit bds-exa-install: root script seem to have succeeded, continuing with setup bds bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/install bds-exa-install: downloading JDK bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/install bds-exa-install: installing JDK tarball bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql/jdk1.8.0_66/jre/lib/security bds-exa-install: Copying JCE policy jars /bin/mkdir: cannot create directory `bigdata_config/mycluster1': File exists bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql/jlib bds-exa-install: removing old oracle bds jars if any bds-exa-install: downloading oracle bds jars bds-exa-install: installing oracle bds jars bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql bds-exa-install: downloading : hadoop-2.6.0-cdh5.5.1.tar.gz bds-exa-install: downloading : hive-1.1.0-cdh5.5.1.tar.gz bds-exa-install: unpacking : hadoop-2.6.0-cdh5.5.1.tar.gz bds-exa-install: unpacking : hive-1.1.0-cdh5.5.1.tar.gz bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql/hadoop-2.6.0-cdh5.5.1/lib bds-exa-install: downloading : cdh-ol6-native.tar.gz bds-exa-install: creating /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql/hadoop_mycluster1.env for hdfs/mapred client access bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql bds-exa-install: creating bds property files bds-exa-install: working directory : /u03/app/oracle/product/12.1.0/dbhome_1/bigdatasql/bigdata_config bds-exa-install: created bigdata.properties bds-exa-install: created bigdata-log4j.properties bds-exa-install: creating default and cluster directories needed by big data external tables bds-exa-install: note this will grant default and cluster directories to public! catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_catcon_29579.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_*.lst files for spool files, if any catcon.pl: completed successfully bds-exa-install: granted default and cluster directories to public! bds-exa-install: mta set to use listener end point : EXTPROC1521 bds-exa-install: mta will be setup bds-exa-install: creating /u03/app/oracle/product/12.1.0/dbhome_1/hs/admin/initbds_orcl_mycluster1.ora bds-exa-install: mta setting agent home as : /u03/app/oracle/product/12.1.0/dbhome_1/hs/admin bds-exa-install: mta shutdown : bds_orcl_mycluster1 bds-exa-install: registering crs resource : bds_orcl_mycluster1 bds-exa-install: using dependency db resource of orcl bds-exa-install: starting crs resource : bds_orcl_mycluster1 CRS-2672: Attempting to start 'bds_orcl_mycluster1' on 'mycluster1bda09' CRS-2676: Start of 'bds_orcl_mycluster1' on 'mycluster1bda09' succeeded NAME=bds_orcl_mycluster1 TYPE=generic_application TARGET=ONLINE STATE=ONLINE on mycluster1bda09 bds-exa-install: patching view LOADER_DIR_OBJS catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_catcon_30123.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_*.lst files for spool files, if any catcon.pl: completed successfully bds-exa-install: creating mta dblinks bds-exa-install: cluster name : mycluster1 bds-exa-install: extproc sid : bds_orcl_mycluster1 bds-exa-install: cdb : true catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_dropdblink_catcon_30153.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_dropdblink*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_dropdblink_*.lst files for spool files, if any catcon.pl: completed successfully catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_dropdblink_catcon_30179.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_dropdblink*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_dropdblink_*.lst files for spool files, if any catcon.pl: completed successfully catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_createdblink_catcon_30205.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_createdblink*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_dbcluster_createdblink_*.lst files for spool files, if any catcon.pl: completed successfully catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_createdblink_catcon_30231.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_createdblink*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_default_createdblink_*.lst files for spool files, if any catcon.pl: completed successfully catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_catcon_30257.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_*.lst files for spool files, if any catcon.pl: completed successfully catcon: ALL catcon-related output will be written to /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_catcon_30283.lst catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon*.log files for output generated by scripts catcon: See /u03/app/oracle/product/12.1.0/dbhome_1/install/bdscatcon_*.lst files for spool files, if any catcon.pl: completed successfully bds-exa-install: setup script completed all steps
For additional details see "Running the bds-exa-install Script".
Repeat step 6 for each database instance, if you have a multi instance database.
When the script completes, the following items including Oracle Big Data SQL is available and running on the database instance. However, if events cause the Oracle Big Data SQL agent to stop, then you must restart it. See "Starting and Stopping the Big Data SQL Agent".
Oracle Big Data SQL directory and configuration with jar, and environment and properties files.
Database dba_directories.
Database dblinks.
Database big data spfile parameter.
For example, you can verify the dba_directories from the SQL prompt as follows:
SQL> select * from dba_directories where directory_name like '%BIGDATA%';
The bds-exa-install
script generates a custom installation script that is run by the owner of the Oracle home directory. That secondary script installs all the files need by Oracle Big Data SQL into the $ORACLE_HOME/bigdatasql
directory. For Oracle NoSQL Database support, it installs the client library (kvclient.jar
). It also creates the database directory objects, and the database links for the multithreaded Oracle Big Data SQL agent.
Alternatively, you can use the --generate-only
option to create the secondary script, and then run it as the owner of $ORACLE_HOME
.
In case of problems running the install script on Exadata, perform the following steps and open an SR with Oracle Support with the details:
Collect the debug output by running the script in a debug mode as follows:
$ ./bds-exa-install.sh --db-name=<dbresource> --grid-home=<grid home> --root-script=false --debug OR $ ./bds-exa-install.sh --root-script=false --debug
Collect the Oracle Database version as follows:
Collect the result of opatch lsinventory
from RDBMS-RAC Home.
Collect the result of opatch lsinventory
from Grid Home
Result of the following SQL statement to confirm that the Datapatch is set up.
SQL> select patch_id, patch_uid, version, bundle_series, bundle_id, action, status from dba_registry_sqlpatch;
Collect the information from the following environment variables:
$ORACLE_HOME
$ORACLE_SID
$GI_HOME
$TNS_ADMIN
Result of running lsnrctl status
command.
In order to give Oracle Big Data SQL access to HDFS data on a Kerberos-enabled cluster, make each Oracle Exadata Database Machine that needs access a Kerberos client. Also run kinit
on the oracle
account on each cluster node and Exadata Database Machine to ensure that the account is authenticated by Kerberos. There are two situations where this procedure is required:
When enabling Oracle Big Data SQL on a Kerberos-enabled cluster.
When enabling Kerberos on a cluster where Oracle Big Data SQL is already installed.
Note:
Oracle Big Data SQL queries will run on the Hadoop cluster as the owner of the Oracle Database process (i.e. theoracle
user). Therefore, the oracle
user needs a valid Kerberos ticket in order to access data. This ticket is required for every Oracle Database instance that is accessing the cluster. A valid ticket is also need for each Big Data SQL Server process running on the Oracle Big Data Appliance. Run kinit oracle
to obtain the ticket.These steps enable the operating system user to authenticate with the kinit utility before submitting Oracle SQL Connector for HDFS jobs. The kinit utility typically uses a Kerberos keytab file for authentication without an interactive prompt for a password.
On each node of the cluster:
Log in as the oracle
user.
Run kinit
on the oracle account.
$ kinit oracle
Enter the Kerberos password.
Log on to the primary node and then stop and restart Oracle Big Data SQL.
$ bdacli stop big_data_sql_cluster $ bdacli start big_data_sql_cluster
On all Oracle Exadata Database Machines that need access to the cluster:
Copy the Kerberos configuration file /etc/krb5.conf
from the node where Mammoth is installed to the same path on each Oracle Exadata Machine.
Run kinit
on the oracle
account and enter the Kerberos password.
Re-run the Oracle Big Data SQL post-installation script
$ ./bds-exa-install.sh
Avoiding Kerberos Ticket Expiration
The system should run kinit on a regular basis, before letting the Kerberos ticket expire, to enable Oracle SQL Connector for HDFS to authenticate transparently. Use cron or a similar utility to run kinit. For example, if Kerberos tickets expire every two weeks, then set up a cron job to renew the ticket weekly.
You can easily create an Oracle external table for data in Apache Hive. Because the metadata is available to Oracle Database, you can query the data dictionary for information about Hive tables. Then you can use a PL/SQL function to generate a basic SQL CREATE TABLE EXTERNAL ORGANIZATION
statement. You can modify the statement before execution to customize the external table.
The DBMS_HADOOP
PL/SQL package contains a function named CREATE_EXTDDL_FOR_HIVE
. It returns the data dictionary language (DDL) to create an external table for accessing a Hive table. This function requires you to provide basic information about the Hive table:
Name of the Hadoop cluster
Name of the Hive database
Name of the Hive table
Whether the Hive table is partitioned
You can obtain this information by querying the ALL_HIVE_TABLES
data dictionary view. It displays information about all Hive tables that you can access from Oracle Database.
This example shows that the current user has access to an unpartitioned Hive table named RATINGS_HIVE_TABLE
in the default database. A user named JDOE
is the owner.
SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables; CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME PARTITIONED ------------ -------------- -------- ------------------ -------------- hadoop1 default jdoe ratings_hive_table UN-PARTITIONED
See Also:
With the information from the data dictionary, you can use the CREATE_EXTDDL_FOR_HIVE
function of DBMS_HADOOP
. This example specifies a database table name of RATINGS_DB_TABLE
in the current schema. The function returns the text of the CREATE TABLE
command in a local variable named DDLout
, but does not execute it.
DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=>'hadoop1', DB_NAME=>'default', HIVE_TABLE_NAME=>'ratings_hive_table', HIVE_PARTITION=>FALSE, TABLE_NAME=>'ratings_db_table', PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLout ); dbms_output.put_line(DDLout); END; /
When this procedure runs, the PUT_LINE
function displays the CREATE TABLE
command:
CREATE TABLE ratings_db_table ( c0 VARCHAR2(4000), c1 VARCHAR2(4000), c2 VARCHAR2(4000), c3 VARCHAR2(4000), c4 VARCHAR2(4000), c5 VARCHAR2(4000), c6 VARCHAR2(4000), c7 VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=hadoop1 com.oracle.bigdata.tablename=default.ratings_hive_table ) ) PARALLEL 2 REJECT LIMIT UNLIMITED
You can capture this information in a SQL script, and use the access parameters to change the Oracle table name, the column names, and the data types as desired before executing it. You might also use access parameters to specify a date format mask.
The ALL_HIVE_COLUMNS
view shows how the default column names and data types are derived. This example shows that the Hive column names are C0 to C7, and that the Hive STRING
data type maps to VARCHAR2(4000)
:
SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns; TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE --------------------- ------------ ---------------- ------------------ ratings_hive_table c0 string VARCHAR2(4000) ratings_hive_table c1 string VARCHAR2(4000) ratings_hive_table c2 string VARCHAR2(4000) ratings_hive_table c3 string VARCHAR2(4000) ratings_hive_table c4 string VARCHAR2(4000) ratings_hive_table c5 string VARCHAR2(4000) ratings_hive_table c6 string VARCHAR2(4000) ratings_hive_table c7 string VARCHAR2(4000) 8 rows selected.
See Also:
You can choose between using DBMS_HADOOP
and developing a CREATE TABLE
statement from scratch. In either case, you may need to set some access parameters to modify the default behavior of ORACLE_HIVE
.
The following statement creates an external table named ORDER
to access Hive data:
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), description VARCHAR2(100), order_total NUMBER (8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive);
Because no access parameters are set in the statement, the ORACLE_HIVE
access driver uses the default settings to do the following:
Connects to the default Hadoop cluster.
Uses a Hive table named order
. An error results if the Hive table does not have fields named CUST_NUM
, ORDER_NUM
, DESCRIPTION
, and ORDER_TOTAL
.
Sets the value of a field to NULL
if there is a conversion error, such as a CUST_NUM
value longer than 10 bytes.
You can set properties in the ACCESS PARAMETERS
clause of the external table clause, which override the default behavior of the access driver. The following clause includes the com.oracle.bigdata.overflow
access parameter. When this clause is used in the previous example, it truncates the data for the DESCRIPTION
column that is longer than 100 characters, instead of throwing an error:
(TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.overflow={"action:"truncate", "col":"DESCRIPTION""} ))
The next example sets most of the available parameters for ORACLE_HIVE
:
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.tablename: order_db.order_summary com.oracle.bigdata.colmap: {"col":"ITEM_CNT", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", \ "col":"DESCRIPTION"} com.oracle.bigdata.erroropt: [{"action":"replace", \ "value":"INVALID_NUM" , \ "col":["CUST_NUM","ORDER_NUM"]} ,\ {"action":"reject", \ "col":"ORDER_TOTAL} ))
The parameters make the following changes in the way that the ORACLE_HIVE
access driver locates the data and handles error conditions:
com.oracle.bigdata.tablename
: Handles differences in table names. ORACLE_HIVE
looks for a Hive table named ORDER_SUMMARY
in the ORDER.DB
database.
com.oracle.bigdata.colmap
: Handles differences in column names. The Hive ORDER_LINE_ITEM_COUNT
field maps to the Oracle ITEM_CNT
column.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
com.oracle.bigdata.erroropt
: Replaces bad data. Errors in the data for CUST_NUM
or ORDER_NUM
set the value to INVALID_NUM
.
You can use the ORACLE_HIVE
access driver to access data stored in Oracle NoSQL Database. However, you must first create a Hive external table that accesses the KVStore. Then you can create an external table in Oracle Database over it, similar to the process described in "Creating an Oracle External Table for Hive Data".
This section contains the following topics:
To provide access to the data in Oracle NoSQL Database, you create a Hive external table over the Oracle NoSQL table. Oracle Big Data SQL provides a storage handler named oracle.kv.hadoop.hive.table.TableStorageHandler
that enables Hive to read the Oracle NoSQL Database table format.
The following is the basic syntax of a Hive CREATE TABLE
statement for a Hive external table over an Oracle NoSQL table:
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ( "oracle.kv.kvstore" = "database", "oracle.kv.hosts" = "nosql_node1:port[, nosql_node2:port...]", "oracle.kv.hadoop.hosts" = "hadoop_node1[,hadoop_node2...]", "oracle.kv.tableName" = "table_name");
Hive CREATE TABLE Parameters
The name of the Hive external table being created.
This table name will be used in SQL queries issued in Oracle Database, so choose a name that is appropriate for users. The name of the external table that you create in Oracle Database must be identical to the name of this Hive table.
Table, column, and field names are case insensitive in Oracle NoSQL Database, Apache Hive, and Oracle Database.
The names and data types of the columns in the Hive external table. See Table 6-3 for the data type mappings between Oracle NoSQL Database and Hive.
Hive CREATE TABLE TBLPROPERTIES Clause
The name of the KVStore. Only upper- and lowercase letters and digits are valid in the name.
A comma-delimited list of host names and port numbers in the Oracle NoSQL Database cluster. Each string has the format hostname:port. Enter multiple names to provide redundancy in the event that a host fails.
A comma-delimited list of all host names in the CDH cluster in Oracle Big Data Appliance with Oracle Big Data SQL enabled.
The name of the table in Oracle NoSQL Database that stores the data for this Hive external table.
See Also:
Apache Hive Language Manual DDL at
Use the following syntax to create an external table in Oracle Database that can access the Oracle NoSQL data through a Hive external table:
CREATE TABLE tablename(colname colType[, colname colType...]) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY directory ACCESS PARAMETERS (access parameters) ) REJECT LIMIT UNLIMITED;
In this syntax, you identify the column names and data types. For more about this syntax, see "About the SQL CREATE TABLE Statement".
When Oracle Big Data SQL retrieves data from Oracle NoSQL Database, the data is converted twice to another data type:
To a Hive data type when the data is read into the columns of the Hive external table.
To an Oracle data type when the data is read into the columns of an Oracle Database external table.
Table 6-3 identifies the supported Oracle NoSQL data types and their mappings to Hive and Oracle Database data types. Oracle Big Data SQL does not support the Oracle NoSQL complex data types Array, Map, and Record.
Table 6-3 Oracle NoSQL Database Data Type Mappings
Oracle NoSQL Database Data Type | Apache Hive Data Type | Oracle Database Data Type |
---|---|---|
String |
STRING |
VARCHAR2 |
Boolean |
BOOLEAN |
NUMBER1 |
Integer |
INT |
NUMBER |
Long |
INT |
NUMBER |
Double |
DOUBLE |
NUMBER(p,s) |
Float |
FLOAT |
NUMBER(p,s) |
This example uses the sample data provided with the Oracle NoSQL Database software:
Verify that the following files reside in the examples/hadoop/table
directory:
create_vehicle_table.kvs CountTableRows.java LoadVehicleTable.java
This example runs on a Oracle Big Data Appliance server named bda1node07 and uses a KVStore named BDAKV.
To create and populate the sample table in Oracle NoSQL Database:
Open a connection to an Oracle NoSQL Database node on Oracle Big Data Appliance.
Create a table named vehicleTable
. The following example uses the load
command to run the commands in create_vehicle_table.kvs
:
$ cd NOSQL_HOME
$ java -jar lib/kvcli.jar -host bda1node07 -port 5000 \
load -file examples/hadoop/table/create_vehicle_table.kvs
Compile LoadVehicleTable.java
:
$ javac -cp examples:lib/kvclient.jar examples/hadoop/table/LoadVehicleTable.java
Execute the LoadVehicleTable
class to populate the table:
$ java -cp examples:lib/kvclient.jar hadoop.table.LoadVehicleTable -host bda1node07 -port 5000 -store BDAKV
{"type":"auto","make":"Chrysler","model":"PTCruiser","class":"4WheelDrive","colo
r":"white","price":20743.240234375,"count":30}
{"type":"suv","make":"Ford","model":"Escape","class":"FrontWheelDrive","color":"
.
.
.
10 new records added
The vehicleTable
table contains the following fields:
Field Name | Data Type |
---|---|
type |
STRING |
make |
STRING |
model |
STRING |
class |
STRING |
color |
STRING |
price |
DOUBLE |
count |
INTEGER |
The following example creates a Hive table named VEHICLES
that accesses vehicleTable
in the BDAKV KVStore. Oracle Big Data Appliance is configured with a CDH cluster in the first six servers (bda1node01 to bda1node06) and an Oracle NoSQL Database cluster in the next three servers (bda1node07 to bda1node09).
CREATE EXTERNAL TABLE IF NOT EXISTS vehicles (type STRING, make STRING, model STRING, class STRING, color STRING, price DOUBLE, count INT) COMMENT 'Accesses data in vehicleTable in the BDAKV KVStore' STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ("oracle.kv.kvstore" = "BDAKV", "oracle.kv.hosts" = "bda1node07.example.com:5000,bda1node08.example.com:5000", "oracle.kv.hadoop.hosts" = "bda1node01.example.com,bda1node02.example.com,bda1node03.example.com,bda1node04.example.com,bda1node05.example.com,bda1node06.example.com", "oracle.kv.tableName" = "vehicleTable");
The DESCRIBE
command lists the columns in the VEHICLES
table:
hive> DESCRIBE vehicles;
OK
type string from deserializer
make string from deserializer
model string from deserializer
class string from deserializer
color string from deserializer
price double from deserializer
count int from deserializer
A query against the Hive VEHICLES
table returns data from the Oracle NoSQL vehicleTable
table:
hive> SELECT make, model, class FROM vehicletable WHERE type='truck' AND color='red' ORDER BY make, model; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 . . . Chrysler Ram1500 RearWheelDrive Chrysler Ram2500 FrontWheelDrive Ford F150 FrontWheelDrive Ford F250 RearWheelDrive Ford F250 AllWheelDrive Ford F350 RearWheelDrive GM Sierra AllWheelDrive GM Silverado1500 RearWheelDrive GM Silverado1500 AllWheelDrive
After you create the Hive table, the metadata is available in the Oracle Database static data dictionary views. The following SQL SELECT
statement returns information about the Hive table created in the previous topic:
SQL> SELECT table_name, column_name, hive_column_type FROM all_hive_columns WHERE table_name='vehicles'; TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE --------------- ------------ ---------------- vehicles type string vehicles make string vehicles model string vehicles class string vehicles color string vehicles price double vehicles count int
The next SQL CREATE TABLE
statement generates an external table named VEHICLES
over the Hive VEHICLES
table, using the ORACLE_HIVE
access driver. The name of the table in Oracle Database must be identical to the name of the table in Hive. However, both Oracle NoSQL Database and Oracle Database are case insensitive.
CREATE TABLE vehicles (type VARCHAR2(10), make VARCHAR2(12), model VARCHAR2(20), class VARCHAR2(40), color VARCHAR2(20), price NUMBER(8,2), count NUMBER) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.debug=true com.oracle.bigdata.log.opt=normal)) REJECT LIMIT UNLIMITED;
This SQL SELECT
statement retrieves all rows for red trucks from vehicleTable
in Oracle NoSQL Database:
SQL> SELECT make, model, class FROM vehicles WHERE type='truck' AND color='red' ORDER BY make, model; MAKE MODEL CLASS ------------ -------------------- --------------------- Chrysler Ram1500 RearWheelDrive Chrysler Ram2500 FrontWheelDrive Ford F150 FrontWheelDrive Ford F250 AllWheelDrive Ford F250 RearWheelDrive Ford F350 RearWheelDrive GM Sierra AllWheelDrive GM Silverado1500 RearWheelDrive GM Silverado1500 4WheelDrive GM Silverado1500 AllWheelDrive
You can also use the ORACLE_HIVE
access driver to access data stored in Apache HBase. However, you must first create a Hive external table that accesses the HBase table. Then you can create an external table in Oracle Database over it. The basic steps are the same as those described in "Creating an Oracle External Table for Oracle NoSQL Database".
To provide access to the data in an HBase table, you create a Hive external table over it. Apache provides a storage handler and a SerDe that enable Hive to read the HBase table format.
The following is the basic syntax of a Hive CREATE TABLE
statement for an external table over an HBase table:
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format'='1', 'hbase.columns.mapping'=':key,value:key,value:
See Also:
Apache Hive Language Manual DDL at
Hive HBase Integration at
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-StorageHandlers
Class HBaseSerDe
in the Hive API reference at
http://hive.apache.org/javadocs/r0.13.1/api/hbase-handler/index.html
Use the following syntax to create an external table in Oracle Database that can access the HBase data through a Hive external table:
CREATE TABLE tablename(colname colType[, colname colType...])
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(access parameters)
)
REJECT LIMIT UNLIMITED;
In this syntax, you identify the column names and data types. To specify the access parameters, see "About the SQL CREATE TABLE Statement".
The ORACLE_HDFS
access driver enables you to access many types of data that are stored in HDFS, but which do not have Hive metadata. You can define the record format of text data, or you can specify a SerDe for a particular data format.
You must create the external table for HDFS files manually, and provide all the information the access driver needs to locate the data, and parse the records and fields. The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL
statements.
The following statement creates a table named ORDER
to access the data in all files stored in the /usr/cust/summary
directory in HDFS:
CREATE TABLE ORDER (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER (8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ) LOCATION ('hdfs:/usr/cust/summary/*');
Because no access parameters are set in the statement, the ORACLE_HDFS
access driver uses the default settings to do the following:
Connects to the default Hadoop cluster.
Reads the files as delimited text, and the fields as type STRING
.
Assumes that the number of fields in the HDFS files match the number of columns (three in this example).
Assumes the fields are in the same order as the columns, so that CUST_NUM
data is in the first field, ORDER_NUM
data is in the second field, and ORDER_TOTAL
data is in the third field.
Rejects any records in which the value causes a data conversion error: If the value for CUST_NUM
exceeds 10 characters, the value for ORDER_NUM
exceeds 20 characters, or the value of ORDER_TOTAL
cannot be converted to NUMBER
.
You can use many of the same access parameters with ORACLE_HDFS
as ORACLE_HIVE
.
The following example is equivalent to the one shown in "Overriding the Default ORACLE_HIVE Settings". The external table access a delimited text file stored in HDFS.
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.colmap: {"col":"item_cnt", "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", "col":"DESCRIPTION"} com.oracle.bigdata.erroropt: [{"action":"replace", \ "value":"INVALID NUM", \ "col":["CUST_NUM","ORDER_NUM"]} , \ {"action":"reject", "col":"ORDER_TOTAL}] ) LOCATION ('hdfs:/usr/cust/summary/*'));
The parameters make the following changes in the way that the ORACLE_HDFS
access driver locates the data and handles error conditions:
com.oracle.bigdata.colmap
: Handles differences in column names. ORDER_LINE_ITEM_COUNT
in the HDFS files matches the ITEM_CNT
column in the external table.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
com.oracle.bigdata.erroropt
: Replaces bad data. Errors in the data for CUST_NUM
or ORDER_NUM
set the value to INVALID_NUM
.
The next example uses a SerDe to access Avro container files.
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.rowformat: \ SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' com.oracle.bigdata.fileformat: \ INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'\ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' com.oracle.bigdata.colmap: { "col":"item_cnt", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", \ "col":"DESCRIPTION"} ) LOCATION ('hdfs:/usr/cust/summary/*'));
The access parameters provide the following information to the ORACLE_HDFS
access driver:
com.oracle.bigdata.rowformat
: Identifies the SerDe that the access driver needs to use to parse the records and fields. The files are not in delimited text format.
com.oracle.bigdata.fileformat
: Identifies the Java classes that can extract records and output them in the desired format.
com.oracle.bigdata.colmap
: Handles differences in column names. ORACLE_HDFS
matches ORDER_LINE_ITEM_COUNT
in the HDFS files with the ITEM_CNT
column in the external table.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
The SQL CREATE TABLE
statement has a clause specifically for creating external tables. The information that you provide in this clause enables the access driver to read data from an external source and prepare the data for the external table.
The following is the basic syntax of the CREATE TABLE
statement for external tables:
CREATE TABLE table_name (column_name datatype, column_name datatype[,...]) ORGANIZATION EXTERNAL (external_table_clause);
You specify the column names and data types the same as for any other table. ORGANIZATION EXTERNAL
identifies the table as an external table.
The external_table_clause identifies the access driver and provides the information that it needs to load the data. See "About the External Table Clause".
CREATE TABLE ORGANIZATION EXTERNAL
takes the external_table_clause as its argument. It has the following subclauses:
See Also:
Oracle Database SQL Language Reference for the external_table_clause
The TYPE
clause identifies the access driver. The type of access driver determines how the other parts of the external table definition are interpreted.
Specify one of the following values for Oracle Big Data SQL:
ORACLE_HDFS
: Accesses files in an HDFS directory.
ORACLE_HIVE
: Accesses a Hive table.
Note:
The ORACLE_DATAPUMP
and ORACLE_LOADER
access drivers are not associated with Oracle Big Data SQL.
The DEFAULT DIRECTORY
clause identifies an Oracle Database directory object. The directory object identifies an operating system directory with files that the external table reads and writes.
ORACLE_HDFS
and ORACLE_HIVE
use the default directory solely to write log files on the Oracle Database system.
The LOCATION
clause for ORACLE_HDFS
contains a comma-separated list of file locations. The files must reside in the HDFS file system on the default cluster.
A location can be any of the following:
A fully qualified HDFS name, such as /user/hive/warehouse/hive_seed/hive_types
. ORACLE_HDFS
uses all files in the directory.
A fully qualified HDFS file name, such as /user/hive/warehouse/hive_seed/hive_types/hive_types.csv
A URL for an HDFS file or a set of files, such as hdfs:/user/hive/warehouse/hive_seed/hive_types/*
. Just a directory name is invalid.
The file names can contain any pattern-matching character described in Table 6-4.
Table 6-4 Pattern-Matching Characters
Character | Description |
---|---|
? |
Matches any one character |
* |
Matches zero or more characters |
[abc] |
Matches one character in the set {a, b, c} |
[a-b] |
Matches one character in the range {a...b}. The character must be less than or equal to b. |
[^a] |
Matches one character that is not in the character set or range {a}. The carat (^) must immediately follow the left bracket, with no spaces. |
\c |
Removes any special meaning of c. The backslash is the escape character. |
{ab\,cd} |
Matches a string from the set {ab, cd}. The escape character (\) removes the meaning of the comma as a path separator. |
{ab\,c{de\,fh} |
Matches a string from the set {ab, cde, cfh}. The escape character (\) removes the meaning of the comma as a path separator. |
Do not specify the LOCATION
clause for ORACLE_HIVE
; it raises an error. The data is stored in Hive, and the access parameters and the metadata store provide the necessary information.
Limits the number of conversion errors permitted during a query of the external table before Oracle Database stops the query and returns an error.
Any processing error that causes a row to be rejected counts against the limit. The reject limit applies individually to each parallel query (PQ) process. It is not the total of all rejected rows for all PQ processes.
The ACCESS PARAMETERS
clause provides information that the access driver needs to load the data correctly into the external table. See "CREATE TABLE ACCESS PARAMETERS Clause".
When the access driver loads data into an external table, it verifies that the Hive data can be converted to the data type of the target column. If they are incompatible, then the access driver returns an error. Otherwise, it makes the appropriate data conversion.
Hive typically provides a table abstraction layer over data stored elsewhere, such as in HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from its stored format into a Hive data type. The access driver then converts the data from its Hive data type to an Oracle data type. For example, if a Hive table over a text file has a BIGINT
column, then the SerDe converts the data from text to BIGINT
. The access driver then converts the data from BIGINT
(a Hive data type) to NUMBER
(an Oracle data type).
Performance is better when one data type conversion is performed instead of two. The data types for the fields in the HDFS files should therefore indicate the data that is actually stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file is text. If the Hive type for a field is DATE
, then the SerDe converts the data from string (in the data file) to a Hive date. Then the access driver converts the data from a Hive date to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not perform a conversion, and the access driver converts the data from string to an oracle date. Queries against the external table are faster in the second example, because the access driver performs the only data conversion.
Table 6-5 identifies the data type conversions that ORACLE_HIVE
can make when loading data into an external table.
Table 6-5 Supported Hive to Oracle Data Type Conversions
Hive Data Type | VARCHAR2, CHAR, NCHAR2, NCHAR, CLOB | NUMBER, FLOAT, BINARY_NUMBER, BINARY_FLOAT | BLOB | RAW | DATE, TIMESTAMP, TIMESTAMP WITH TZ, TIMESTAMP WITH LOCAL TZ | INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND |
---|---|---|---|---|---|---|
INT SMALLINT TINYINT BIGINT |
yes |
yes |
yes |
yes |
no |
no |
DOUBLE FLOAT |
yes |
yes |
yes |
yes |
no |
no |
DECIMAL |
yes |
yes |
no |
no |
no |
no |
BOOLEAN |
yes2 |
yes |
yes3 |
yes |
no |
no |
BINARY |
yes |
no |
yes |
yes |
no |
no |
STRING |
yes |
yes |
yes |
yes |
yes |
yes |
TIMESTAMP |
yes |
no |
no |
no |
yes |
no |
STRUCT ARRAY UNIONTYPE MAP |
yes |
no |
no |
no |
no |
no |
Users can query external tables using the SQL SELECT
statement, the same as they query any other table.
Users who query the data on a Hadoop cluster must have READ
access in Oracle Database to the external table and to the database directory object that points to the cluster directory. See "About the Cluster Directory".
By default, a query returns no data if an error occurs while the value of a column is calculated. Processing continues after most errors, particularly those thrown while the column values are calculated.
Use the com.oracle.bigdata.erroropt
parameter to determine how errors are handled.
Oracle Big Data SQL runs exclusively on systems with Oracle Big Data Appliance connected to Oracle Exadata Database Machine. The Oracle Exadata Storage Server Software is deployed on a configurable number of Oracle Big Data Appliance servers. These servers combine the functionality of a CDH node and an Oracle Exadata Storage Server.
The Mammoth utility installs the Big Data SQL software on both Oracle Big Data Appliance and Oracle Exadata Database Machine. The information in this section explains the changes that Mammoth makes to the Oracle Database system.
This section contains the following topics:
Note:
Oracle SQL Connector for HDFS provides access to Hadoop data for all Oracle Big Data Appliance racks, including those that are not connected to Oracle Exadata Database Machine. However, it does not offer the performance benefits of Oracle Big Data SQL, and it is not included under the Oracle Big Data Appliance license. See Oracle Big Data Connectors User's Guide.
The Big Data SQL agent on the database is managed by the clusterware. The agent is registered with clusterware during Big Data SQL insallation, to automatically start and stop with the database. To check the status, you can run from the Grid or Clusterware home:
mtactl check bds_databasename_clustername
The common directory contains configuration information that is common to all Hadoop clusters. This directory is located on the Oracle Database system under the Oracle home directory. The oracle
file system user (or whichever user owns the Oracle Database instance) owns the common directory. A database directory named ORACLE_BIGDATA_CONFIG
points to the common directory.
The Mammoth installation process creates the following files and stores them in the common directory:
The Oracle DBA can edit these configuration files as necessary.
The bigdata.properties
file in the common directory contains property-value pairs that define the Java class paths and native library paths required for accessing data in HDFS.
These properties must be set:
The following list describes all properties permitted in bigdata.properties
.
bigdata.properties
Property | Description |
---|---|
bigdata.cluster.default |
The name of the default Hadoop cluster. The access driver uses this name when the access parameters do not specify a cluster. Required. Changing the default cluster name might break external tables that were created previously without an explicit cluster name. |
bigdata.cluster.list |
A comma-separated list of Hadoop cluster names. Optional. |
java.classpath.hadoop |
The Hadoop class path. Required. |
java.classpath.hive |
The Hive class path. Required. |
java.classpath.oracle |
The path to the Oracle JXAD Java JAR file. Required. |
java.classpath.user |
The path to user JAR files. Optional. |
java.libjvm.file |
The full file path to the JVM shared library (such as |
java.options |
A comma-separated list of options to pass to the JVM. Optional. This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls: Xmx2048m,-verbose=jni |
LD_LIBRARY_PATH |
A colon separated (:) list of directory paths to search for the Hadoop native libraries. Recommended. If you set this option, then do not set java.library path in |
Example 6-3 shows a sample bigdata.properties
file.
Example 6-3 Sample bigdata.properties File
# bigdata.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata.properties - Big Data Properties File # # DESCRIPTION # Properties file containing parameters for allowing access to Big Data # Fixed value properties can be added here # java.libjvm.file=$ORACLE_HOME/jdk/jre/lib/amd64/server/libjvm.so java.classpath.oracle=$ORACLE_HOME/hadoopcore/jlib/*:$ORACLE_HOME/hadoop/jlib/hver-2/*:$ORACLE_HOME/dbjava/lib/* java.classpath.hadoop=$HADOOP_HOME/*:$HADOOP_HOME/lib/* java.classpath.hive=$HIVE_HOME/lib/* LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib bigdata.cluster.default=hadoop_cl_1
The bigdata-log4j.properties
file in the common directory defines the logging behavior of queries against external tables in the Java code. Any log4j
properties are allowed in this file.
Example 6-4 shows a sample bigdata-log4j.properties
file with the relevant log4j
properties.
Example 6-4 Sample bigdata-log4j.properties File
# bigdata-log4j.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata-log4j.properties - Big Data Logging Properties File # # DESCRIPTION # Properties file containing logging parameters for Big Data # Fixed value properties can be added here bigsql.rootlogger=INFO,console log4j.rootlogger=DEBUG, file log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n log4j.appender.file=org.apache.log4j.RollingFileAppender log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n log4j.logger.oracle.hadoop.sql=ALL, file bigsql.log.dir=. bigsql.log.file=bigsql.log log4j.appender.file.File=$ORACLE_HOME/bigdatalogs/bigdata-log4j.log
The cluster directory contains configuration information for a CDH cluster. Each cluster that Oracle Database will access using Oracle Big Data SQL has a cluster directory. This directory is located on the Oracle Database system under the common directory. For example, a cluster named bda1_cl_1 would have a directory by the same name (bda1_cl_1
) in the common directory.
The cluster directory contains the CDH client configuration files for accessing the cluster, such as the following:
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
(optional)
log4j
property files (such as hive-log4j.properties
)
A database directory object points to the cluster directory. Users who want to access the data in a cluster must have read access to the directory object.
See Also:
"Providing Remote Client Access to CDH" for a more detailed discussion of Hadoop clients.
The oracle
operating system user (or whatever user owns the Oracle Database installation directory) must have the following setup:
READ/WRITE access to the database directory that points to the log directory. These permissions enable the access driver to create the log files, and for the user to read them.
A corresponding oracle
operating system user defined on Oracle Big Data Appliance, with READ access in the operating system to the HDFS directory where the source data is stored.
0 for false, and 1 for true
FALSE maps to the string FALSE
, and TRUE maps to the string TRUE
.
FALSE maps to 0, and TRUE maps to 1.